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A STUDY OF COMPUTER-ADMINISTERED STRAD^TIVpf ABILITY TESTING 



In thm early years of mental measurement, teats of Individual differences 
were dealfoed for Individuals rattler than ^groups. -^Inet's l'ry:dlllgence 
test, for example, was tailored to the Individual. Using the Bine t approach, 
the examiner neither wasted time administering items which were too easy 
for a tei'tee nor frustrated the testee with •items which were much top difficult. 
But the expense of the individual test administration required ^y Blnet^'s 
approach forced test maker's to devise an alternative measuremlent strategy 
that required leps administration time by a trained psychometrist . The 
result was the group test. 

In the process of applying group tests to the measurement of ^ 
Individuals, Mny of the advantages of individualized testing were sacrificed 
for the -greater efficiency possible by measuring large groups of • individuals 
at one time. The result was tests whlcH were too difficult for some testees, 
and too easy for others, with measurement accuracy that varied widely as 
a function of ability level. Although group tests measured well the average 
member^ of the population for which they were. Constructed, there was still 
rooirf for considerable Improvement. - 

The advent of interactive computers provided an economical path for 
a return to individualized testing. With their development came the means 
to. construct tests to efficiently measure individuals who were not 
necessarily typical of certain populations. A variety of -techniques have been 
proposed for adminlaterlng ability tests 6n interactive Qomputer systems. 
Weiss and Betz (1973) summarize the recent literature on ^adaptive, or tailored, 
testing. 

A basJferP^mise of adaptive testing is that the b^est test fof 
lueas^ing Indivldu^ is a , test with item difficulties peaked ^t the ability 
level og that individual rather than at the^mfean ability of a population. 
The f ?ct that abilltyils not known until the end of the tefet has resulted 
in a diversity of s^;r3tegles for choosing the items to be administered to 
a given individual (Weiss, 1974). These strategies can bja divided into two 
subcaVegorles-f two-stage and multi-stage- strategies. The latter are of the 
most Interest .here and can be further divided into two subclasses: variable 
branching proj^edures and fixed branching procedures. \ 

/ ' ^ ' ^ ' . 

Variable/ branching procedures include Bayeslan and maximum likelihood y 

approaches, [sayesian strategies, , such as -those proposed by Novlck (1969) 

and Owen (19^9), may begin with some initial estimate of the testee's ability, 

such as grade-point average or the testee 's own suBjectlve ability, estimate. 

Given this a>lllty estimate, everyr available- Itetn in the item pool is examined. 

\Then, on the' basis of the guidelines set by the partl-cular model in dse, 
the best item is chosen to be administered^ Given the response to that item 
and the initial .ability estimate, a new ability estimate is calculated and the 
procedure is repeated. " The test usually terminates when a desired degree of 
precision of meai|urement is reached. Bayeslan strategies ' have as- their 
advantages the capability of using prior information about the ability of 

* an individual, the tailoring of test length as well as difficulty, and, by 
examining all available items at each stage, the capacity to make very. efficient 
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, use of an Item pool.. Their disadvahtage may lie in the failure of real items 
and real individuals t a meet the ciyucial assumptions on which the Bayesian 
models are based. In addition, the computing time required to search a large 
.item pool might iQwer their u^lity in int;eractive testing on small computer 
systems. 

The fixed branching procedures upe a set of items which are pre-structured 
by difficulty and/or discrimination. In these strategies, at any stage in 
the test a testee is branched* to one predetermined category of items (which 
may consist of a single item) if he answers the' item correctly, or to another 
predetermined category of items if he answers incorrectly • Because branching- • 

* from any item is dependent only on the response to that item, the item pool 
does not need to be searched after each item response. 

Most fixpd branching procedures are variations of the pyramidal testing 
strategy (e.g., Larkin & Weiss, 197A) . A pyramidal test has its items arranged , 
. in a triangular o\ pyramidal structure with item difficulties ^t the peak 
centered ox\ the mean ability of the population^ of individuals to be measured 
(see Weis9, 1974, pp. 12-36). Difficulties increase or decrease with 
distance to the right or left of the peak. An individual taking a test under 
this strategy is first administered the item at the peak. If he answer^ it 
correctly, 'he is branched to the more difficult of the two items, in the second 
stage; if he answers it incorrectly, he is branched to the less difficult item.- 
This process continues until the testee rea\^hes the end of a fixed number 
of stages. ' ' 

^ Since each ^st age of a pyra^dal test requires a number of items equal 

to the number of that stage, the pyramidal test requires a substantial number 
of items , (n(n-l) for an n-stage test). Furthermore, the pyramidal test is 
very inefficient in its use of available items. A pyramidal test has items 
at a- number of difficulty levels. With the exception of the individual who 
answers all items correctly or all items incorrectly, a testee enters mo&t 
difficulty levels somewhere after the first item at that difficulty level. 

* Consequently, all the preceeding items at that difficulty level are not used. 
This is a problem in any real operationalization of the pyramidal strategy 
because there is no good position in the structure to put the mostf discrimin- 
ating items. At no point in the structure will these items be routinely admin- 
istered to testees whose sequence of item responses requires items to be 
administered at ^ given difficulty l6vel. 

Thup, the Bayesian strategies are promising because of their use of 
prior information, optimal branching, item economy, and flexible termination. 
^But it remains to be seen whether 'the assumptions on which such strategies 
/ are based will be sufficiently met by real'it^s and real individuals to 
. realize an advantage in utility. The pyramidal strategy has as its advantage . 
the lick of restrict;Lve assumptions needed by the Bayesian strategies but 
lacks all the advantages of the Bayesian strategies—it makes no use of 
prior information, its termination criterion is inflexible, and itWkes 
very inefficient use of an item pool. Clearly v some compromise approach is 
called for. 
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Such a strategy was proposed by Weiss (197i) and was named the stratified 
adaptive! or stradaptive, ajjilfty test. The sKadaptive test is a collection 
of short peaked ability tests, each of these tests being referred to as a 
atratum. These strata are ordered by difficulty ahd are equally spaced along 
the ability continuum. Items within each stratum are ordered with the most 
discriminating items appearing first. Beginning with any rough ability 
estimate, a tested can begin the test in any stratum >nd is administered the 
most discriminating item in that stratum. On the basis of his response, the 
testee is branched either to a stratum* with more difficult itemfe, or to one 
with easier items, and is administered the most discriminating item in the 
jchosen' stratum. This process continues, with the testee being administered 
the most discriminating item yet unadmlnistered in each stratum, until some 
termination criterion is reached. One termination criterion for the stradaptive 
tes't is based on a criterion borrowed from Binet. Its goal is to locate the 
level of chance responding, and termination occurs once this "ceiling level" 
is reliably located^ 

• ' « ' ■ * > 

The stradaptive strategy bears some similarities to the Markov process 
with a reflecting barrier proposed by Mussio (1973), which was essentially a 
truncated pyramidal test. The stradaptive test is ^different in that it lacks 
Mussio's formal item' structute, thus allowing, better item economy, and lacks 
the common entry point and fixed number x>f items administered, which are 
characteristic of the pyramidal strategies. * 

The stradaptive test lacks the optimal branching of the Bayesian 
strateg^.es but retains their advantages of untilization of prior information, 
"tailored termination, and efficient use of the item pool. Its^ further 
advantage ia that it does not require the restrictive assumptions on which the. ^ 
Bayesian strategies rest. 

rMatere (1974, 1975) reported the results of a study of live stradaptive 
ability testing. Using a pool of 250 verbal analogy items 6"btained from 
Educational Testing Service, he administered 46 conventional tests and 53 
itf adaptive tests J:o cpllege* studenW. His design allowed for the 
computation gf both parallel forms reliability and validity coefficients. 
Validity was operationalized as a cori-elation between scores on his tests and 
scores on a convention&l test composed of similar items taken earlier. His 
major findings were that 1) the stradaptive strategy was able to attain parallel 
forms reliabilities and validities comparable to a conventional test having 
twice as many Items; 2) the I'eliability and validity of the stradaptive 
scores was strongly , dependent on the termination criterion used; and 3) some 
methods of scoring the stradaptive test gave higher validities and reliabilities 
•than* other scoring methods,- with the average dif ficult^^of all items answered 
correctly consistently being one of the highest. 

The present paper reports on the administration bf two different ' ' 
stradaptive tests tb college students to study the stradaptive strategy s 
psychometric characteristics, using an item pool and evaluative criteria 
different than those used by Waters. Further details on the logic and 
rationale of stradaptive testing are given in WeitfH X1973) . 



METHOD 

J . Design 

This study was part of a larger research program studying the utility 
of computerized ability testing. One -goal of the program Is to determine the 
empirical relationships between ability estimates derived from the various 
adaptive testing strategies, as well as their relationships with ability 
estimates derived from a conventional test. In ad4ltlon, strategies of 
adaptive ti^stlng are being evaluated In terms of other psychometric 
characteristics, In an attempt to Identify those strategies which are most 
promising for practical. applications. As part of that program,' this study 
Investigated the stradaptlve testing strategy. • • 

A 40-ltem conventional vocabulary test and two forms of a Stradaptlve 
vocabulary test were administered to college students. The two ^stradaptlve 
forms differed In that one counted question mark responses (I.e., omitted 
Items) as Incorrett and the other ignored items responded to with question 
marks . . * - . 

/ All tests were presented using Datapolnt 3000' cathode-ray- terminals . 
(CRTs) acoustically coupled to a Control Data Corporation 6400 time-shared 
computer. The testeej responded on the CRT keyboard to each dtem presented with 
either a number indicating the multiple-choice alternative chosen or a question 
mark if he did not know the aneUer and choSe not to guess*^- (See DeWitt and 
Weiss, 1974, for details of the test administration software.) 

This study was concerned with two major kinds of analysefs. First, data 
from the three tests were analyzed in terms of the characteristics of their 
score distributions, the correlations between stradaptlve and conventional 
tests, find the magnitudes of their test-retest stabilities. Setond 
characteristics of the stradaptlve tests were Investigated to provide a basis- 
for refinement of the strategy. Among^ the characteristics investigated 
were *the Intercorrelatlons among the many methods of scoring *a stradaptlve 
test. This was done to determine which scores were redundant and coul-d be 
eliminated. The utilities of the consistency scores in predicting: test- 
retest stability were also investigated- .To provide data for future^ 
^ development of subject characteristic curves (described below) , stability 
of stradaptlve test response records was investigated. - The Impact of ignoring 
omitted items was evaluated in terms of relative test-retest stabilities 
of scores derived from the two forms of stradaptlve tests, and in terms of 
,the relative difficulties of items giving rise to ^question mark responses. 
Finally, to evaluate the adequacy of the item pool (i.e., the effect of 
having many highly discriminating easy items but few. highly -discriminating , 
difficult items), test scores w^re correlated with test length. 

/ Implementation of the Stradaptlve Testing Strategy 

Item Structure ^ . 

For this study, two forms of the ;ptradaptive test were prepared and will 
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be referred to as Stradaptive 1 and Stradaptive 2. Stradaptive 1 is the 
stradaptive test used for illustrative purposes by'Weiss (1973)-. It 
consisted of 229 vocabulary items taken from a larger pool of 369 itejna-, 
with the restriction that items not overlap with those of a conventional 
test constructed for purposes of comparison. The larger pool was described 
by McBride and WeisI (1974) , and norming item statistics for the 229 items 
used here are given in .Appendix T6ble A-1 . , - 

Summary statistics for the items in both forms of the stradaptive test 
are giveti in Table 1.- As is shown, the items in Stradaptive 1 were grouped^ 
into nine strata with stratum S centered on a normal ogive difficulty 
(Lord and NoVick, 1968, pp. 376-378) of b-.007. The width of each stratum, 
an^i distance between the means of successive ftrata,* was about 0.65 normal 
ogive difficulty units. ' * * 

Stradaptive 2 consisted of 269 vocabulary items. This set of itpms 
W0S composed of most of the original 229 items of' Stradaptive 1, the 40 
items whicli were originally used in the conventional test, and a few new items. 
As can be seen from Table 1 , • the* ^teiii structure of Stradapti'fre 2 was quite 
similar to* that of Stradaptive 1, both consisting of items arranged in nine 
strata spaced about 0.65 difficulty units apart. Norming item ^tatist;Lcs 
tor 'Stradaptive 2 are also presented in Appendix Table A-l.> 

The most important distinction between theN:wo stradaptive tests is 
the manner in which question mark responses (i.e., omitted items) wer^ 
handled. In Stradaptive 1, a question mark was treated as an incorrect 
response, li caused the testee to be branched down one stratum, and was 
counted as incorrect when the scores were calculated. To investigate the effect 
of not penalizing the testee for answering honestly when he was not sure of 
the correct answer, question mark responses were ignored in Stradaptive 
2. The dubject was administered the next item in the same stratum (i.e., 
branched neither up nor down) and the item to which he responded with a 
question mark was not included in the calculation of scores. 

The entry point or stratum in which the test was begun was determined 
for each testee using his reported grade-point average. Tne display presented 
on the CRT screen to the testee for this purpose, along with the entry 
stratum resulting from his response (which,, of course, was not on the CRT 
screen) is shown in Figure 1. . 

Several branching rules were discussed by Weiss (1973) with respect to 
the stradaptive strategy and have been considered in discussions of other 
adaptive testing strate^es (e.g. , '*^feiss and Betz^ 1973; Larkin and Weiss, 
1974). The technique used here was the simple up-one, down-one branching 
rule. A testee was branched to the first unanswered item at the next more 
difficult stratum following a. correct response, and to the first unanswered 
item at the next easier stratum following an incorrect response. The exception 
to this rule was when the testee gave a correct response to an item in the 
most dlf f icult-^ratum or an incorrect response to an item in the least 
difficult stratumv .In those instances, the test^was branched to the next 
it^m ih the same stratum. 
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Table 1 

Sunanary Statlatlcs for Conventional and Normal Ogive Item Parameter's for 
Two Stradaptive Tettts, by Stratum 
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-.565 




Mean 






.631 


.506 


-.633 


.6A8 




.632 


^ .499 


.617 


S.D. 






.055 


.116 


.192 




/ 


-044' 


.098 


.178 


.236 


High. 






.731 


.680 


-.3A3 


1.822 




. .731 


.680 


-.343 


1.822 


Low \ 






.5A2 


.259 


-.998 


.268 




.542 


.288 


-.998 


.301 


Cv* «■ n t> • (vL 

o^raCUI\, 
























Mean 






. .A9a 


.558 


.007 


.652 ' 




.503 


.^31 


-.020 


.643 


S.D. 






.0A3 


.lAl 


.189 


.270 




.042 


.U7 


.196 


.220 " 


High 






.568 


.79A 


.329 


1.306 




.568 


:79* 


.329 


1.306 


Low 






.A27 


.331 


-.285 


.317 




.427 


.331 


•*..319 


.317 


a era cum. 




1 0 










22 
















.382 


.A69 


.651 


.5A9 




.379 


.455 


' .695 


.527 








.036 


.106 


.185 


.177 




.036 


.107 


.206 


.175 


High 






.A3A 


.700 


.977 


.980 




.434 


.700 


.977 


.980 


Low 






.305 


.3A6 


.337 


.369 




.305 


.296 


^337 


vtiJlO 


Stratum 


7 


4 J 








25 








.460 


Mean 






.295 


.A36 


1.327 


.A56 




.296 


.437 


S.D. 






.039 


.083 


. .183 


.113 




.059 


.084 • 


.182 


.115 


High 






, .353 


.618 


1.630 


.718 




.353 


.618 


1.630 


.718 


Low 






.217 


.323 


1.00/» 


.312 




.217 


.323 


1.004 


,312 


Stratum 


8 


15 










15 *, 




.4*27 


2.006 


.482 » 


Mean 






.200 


.A27 


2.096 


.A82 




.200 


S.D. 






.0A7 


-.087 


.206 


.133 




.047 


.087 


.206 


.133^ 


High 






.27A 


.6A8 


2.313 


.851 




.274 


.648 


2.313 


.^51 


Low 






.110 


.321 


1.6A9 


, .339 




.110 


.321 


l;649 


.339 


Stratum 9 


10 










10 






2.621 


.427 


Mean 






.168 


.387 


?.621 


Ml 




• .168 


.387 


S.D. 






.069 


.103 


.273 


.163 




.069 


.163 


.273 


.163/ 


High 






.}00 


.6A3 


3.113 


.8A0 




.300 


.643 


3.113 


.840 


Low 






.029 


.253 


2.320 


.214^^ 




.029 


.253 


2.320 


.214 



^ , T*^ ^ Figure 1 . ' • 

V ' Stradaptlve Test Entry Point. Question 

\ .y " Entry Stratum 

-IN piCH CATEGORY IS YOUR CUMULATIVE GPA TO DATA? Cnbt se^n 

. - \ by student) 







3,76 to 4..00 „ ' 






.2. 


3.51 to 3*. 75 


•••••««**8 




3. 


3.26 to 3.50 






: 4. 


3.01 to 3,25 






5. 


2.76 to 3-.00 ' • 


•••••••••3 




6. 


2.51 to 2.. 75 ' ^ 






7. 


2.26 to 2.50 , 






8. 


2.01 to 2.25 


• • i 9 m m m m2 




9. 


2.00 OR LESS 




EOTER THE 


CATEGORY 


(1 THROUGH 9) AND PRESS THE RETURN KEY. 





Scoring the Stradaptlve Test V ' ^ 



Celling and basal strata . Several methods of scbrlng. the st*radaptive 
test* require: the 'use bf calling aiid basal strata; These two concepts were 
boi:f owed firom, individual intelligence te3tingi primarily the Binet test, 
the basal level of re^onding*' is that, difficulty leve^l of items at which 
the .^estee answers ali^items correctly. The use of the basal level assumes 
that all 1^8^ difficult items would also be. answered correctly, an4,^ there- 
fore, easier items are not administered once a basal level hds been " ^ 
established. The basM, stratiim was defined for use' in the stradaptlve test 
by Ifeisi; (1973) as the most difficult stratum at wliicjh all items "were answered 
Correctly^, - \^ . . ■■" ■ '■'■''~ ^'y - • ■ >« 

■ '■- ■ '-v ■ • • ■ ' ■ ' ■■' , ■ ; ^ ■ . / ■ .'-VV ;^>v 't-' ■ if ' ■>'•' ^ 

Ill the present dai;jt^->^ SttaJtum existeji;^ was identiif led ^ 

basalt If no strat^ ej^f^d iir^^ ^ iXBtas^^'i^^^^^ ^ij^^ answered ; 
.correctly but at least on^ itdm^was: adrndLnister^^ii at t^^ast ^4if^^^ 
stratum, it was; assumed;?tliat all sticita Were too difficult; to be called 
basal and the hypothetical stratum^below the i<Swest actua^^^ 

taken as basal. All other v^opditions (etUge j' t^ \^s^ incomplete, 

theti was ti6 identifiable bd^ai* stratum, lihd^^^^h^^ criterion had been 

reaohed)/ were considered abnormal termi^atiotis^^a^^ subject; Vps. ellmiriated . 

Most M)normal terminatlonis wertf^ aliii«qpgh >a f ew ; 

were ^^ausetl^by sujjjects 

T^^^ of respohiing^^S^^^ at which 

the. testee anst)i/ers no items correct lyl i^ the ceiling / 

stratiifl asistimes. th£tt he would >1iswer nq lteiij§ any level of 

greg^ter dif f icuity .'^ Conseqt^ ^P'^ ^M-^^^^t items^^^^ jxot administered. 
In the clase of multiple-chbiceL Jjieiii^, the. t^ste(^ i#j^ lA^^ some 

* ' . . K 



1 ■ N • ' 

items correctly due s'olely to chance successes. These chance^ successes 

ar6 mos^ likely to occur" on iteiag which ar€^,,,t<)o,. difficult for a given, te^t^e. 

Thus,* for multipl^--choice items, the- ceilljig i^^^ be defined aa that 

level, of difficulty whete the testee answers correctly na mOre- items titan - 

would-be expected* fr dm random guessing. . • . ' ; 

In this/study, which used "fiviB-altepiative multiple-choice item^^ tKe 
ceiling stratum, wias dfefin^d as the least difficult stratum in which five 
or more Items were administered and the testee answered 20% or less^ correctly. 
The five item minimum was established to allow a reasonably stable estimate 
of proportion corret^ at a given stratum^ for the critical termination 
criterion. If ^uch a stratum existed. It wa^^ identified as the ceiling stratup. 
If no such stratum existed,, but all items. at the most difficult stratum 
had been administered, the ^Tiypothetical stratum immediately, above the most 
difficult stratiwjr^was taken as ^the .celling stratum. All other conditions 
were Considered abnormal termlpationS and ^e testee was eliminated from 
all analyses. ' ' ' 

Ability levdi scores . Weiss (1973) proposed ten methods of scoring the 
stradaptlve test to obtain ability level estimates.. These^ , ability level * 
scores are referred to by number in the figures and tables:*^tlix?oughout this 
report. Score nv£aibers and brief descriptions are shown together in the 
sample stradaptlve test report shown in Figure 2. . ■ -■ 

Scores 1 through 3 arfe item difficulty scores'. These , scoring methods 
are borrowed from the pyramidal testing strategy (see Larkin and Weiss ^ 
'1974; Weiss, 1974,. pp. 12-36). Score 1 is the difficult/- of the most . 
difficult item answfered correctly. With t|ie ex<pption .of abiiotmal terminations, 
this score could always be'^determlned and was use^ as defined. 

Score 2 is the difficulty of the (N+Dt^z' item, or the next item that would 
have been administered had "testing continued beyond termination. This score 
could not be determined in two circumstances.' First, if termination was 
caused by running out of items in the next stratum to' be drawn from, ther6 
obviously was^^no item from which to determine the score. Second, if the 
JUth item was in the highest stratum and the response was correct, or the 
\ J^th item was in the lowest stratum and the response was incorrect, the 
Ol-^iyth item would be chosen from a stratum that did not exist (i.e., a 
hypothetical stratum). In these cases, the effect would be the same as 
in the first situation (i.e., there would.be no item from which to determine, 
.the score). In the first situation, where there was an insufficient numbei: 
of items in an existing stratum, the average difficulty of the Items in 
the Stratum (the stratum difficulty)* was substituted as the testee's score. 
In the second case, difficulties -of hypothetical stJ^ta .65 units above 
the^inost difficult existing stratum or .65 units below the least difficult 
strattum were used as the testee's score. , 

Score 3 was defined as the most difficult non-chance item answered 
correctly. This was determined from the difficulty of the most difficult 
item answered in the stratum immediately below the testee's ceiling stratum. 
This item existed, and thus the score could be determined, except in the . 
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condition -where the ceiling stratum was stratmi 1, the lowest actual 

stratum. In this case the difficulty of ^the lower hypothetifcal stratunit 
'was used*. " ' ' , * » " • - ^' * 

' .:\ ■ ' = # . ' I ' • ^ • , ^ ^' v 

Scores 4 through 6 can be referred to as stratum score^. These 

are stratum difficulty analogues to the three item difficulty ^scorias. * 
. Sc&re ' 4 was defined as tfie mean difficulty of the items att the most 

difficult ^fratum^in which dt least one item was answered cprrectly. 
.Sgore'S was Refined as the irfean difficulty of the stratum cbntainlng the 
, <N+1) item (or hypothetical i teta if tio item existed) . Scire^ is the 
imean difficulty of the highest non-chance stratum or the stratum 'immediately 

below the ceiling stratum, these three 'scores, barring abnormal tepaination,' 

were always detet&Ln^ble an4 were lmplement;ed as defined. 

" Score 7, the a:nt6rt>olated stratum difficulty score, was ^ attempt 
to determine the exact stratum difficulty, at which the :testee would ^respond 
at a chance level wt^en that difficulty fell Bfetween two- availal^le gtrattim 
difficulties. Algebraically,, it was defined as; ' \ 

- . \ - ■ ' , • ^ . •- 

A«D , + S (P , - 450) - * 

* c-1 c-1 

where: D _ vis the average difficulty of the (c-l)t?2 "stratum and 
c— X . , ^ ^ " . 

: ' ' p is the ceiling striatum.. It is therefore, the average 

diJfictilty of all iteiiis available' at the testee's highest 
non-chance stratum, or 'the stratum just below his ceiling 
^stratum. ' ) * ' - 

,P ^ is the testee's proportion correct at the (c~l)tk stratum 
c— 1 ' ' ' 

and S is D/ -D *, if P ' is greater than .50, or D" -D ^ if 
' c c-1 c-1 _ c-1 c-2 

' ^c-1 ^^^^ than .50, where © is the ave.rage difficulty . 

. of the designated stratum. . . 

' r . ' ^ " ' 

It was possible to calculate this score except in the condition where the 
ceiling stratum was stratum 1. In that case, no proportion correct was 
available for the "c-1" stratum and the score could not be calculated. 
In this* study, this .particular condition never occurred. Thus, with the* 
exception 6t al^nbrmal t^irminations , score 7 was detei^nable for all 
testees. 

Finally, three average difficulty scores were defined. Sco^i S was 
defined' as the average difficulty of all items answered correctly and was 
calculable in all cases. Score 9 was defined as the average ^^fficuity 
of items correct between, biit not includihg, the ceiling and b^sAl strata. 
The hypothesized advantage of this gcOre over score 8 was tHat It would be 
less susceptible to bias from InapproprjLate entry points.. This score couJLd 
be determined except whe^n no items were answered correctly between, the 
ceiling atld basal strata, a condition caused by the celling and basal 
: strata Beihg ad jaceritV When this occurred, score 9 was not calculated. 
Score 10 is the average difficulty of . all items^^swered correctly >at the 
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highest non-chance stratum and was calculable except when the celling 
stratum was stratum 1, a condition not encountered in this study. ' 

\ Consistency scpres . Weiss (1?73) suggested that the consistency of the 
response record, or variability of- difficulties of items enQoiintered b^r a 
given testee, might yield information about the confidence which cotild be 
placed on the point ability estimates obtained froiDj^he first ten scores. 
^Specifically, he. said (p. 26), "Individuals who are more consistent should 
liave more atable ability estimates, while those who ^re l6ss consistent 
should '^have less stable ability estimate's.*'* This hypothesis was studied 
using ^ive scores designed ^ to Reflect response cbnsistency. - - 



, / Two consistency indiceis reflect the overall Variability of the ^, 

7 difficulties ot the items . administered to a given tes tee. Score 11 is ( . 
defined as^ th/^^ standard deviation , of the difficuO^ of all items administered. 

r Excjeipt for ' abnormal terminations, this score was always calculable. Score. 

^ I'^^is defined as the standard deviation of item dif f ict|lties of all items 
answered correctly. In this study this score als© was available for all • 
testees. *^ ^. " 

' In an attempt to control for Hiappropriate entry points, three indices — 
reflect consistency using an individual's celling. and basal strata. 'SdorK. > 
13 is defined .^s the standard deviation of dif f iculti^j?r of ^.eill items ^ \ 
* answered correctly between the ceiling and basal strfEta. Sil^ score could 
not be calculated for a giyeti tes tee when less than tvo itenls w^e answered 
"correctly between the ceiling and basal strata^, a condition always caused 
by the ceiling and basal stirata being adjatent. Scqre 14 is defined as 
the difference in average stratum difficulties of thi^ ceiling and basal, 
strata. Scores 14 and 15 have an advantage over score 13 in that they are 
calculable for all tfestees. . ^ 



A Sample Stradaptive Respoi^e Record 



Figure 1 shows the stradaptive test performance of a college sophomore. 
This test record is typical of the stradaptive test performance of college 
students. The testee was first presented with an entry pdlnt screen 
(Figure 1) and indicated that his cuinulative grade-point average to date 
•was betyeeA 2.76 and 3.00. He thus began the stradaptive test at stratum 5, 
His answer to the first item^as cortect (indicated by a -fVin Figure 2), 
which branched him to the first available item in stratum 6. Correct 
aWers to, the sec^^fl^^i third iteiiL resulted in his moving ^o^^ratum 8, 
where he receive^S^ first item froU that more di«ficult peaked test. Since 
the stage .4 item was too difficult ^r him, hi? response wa^^incorrect (-), 
and he branched* downward to the second item in Bt^ratum 7. l?he student theti 
alternated between correct and incor^rect responses for the items stages, 5 
through 8, followed by an incorrect.Wesponse to the stage 9^ item. This 
ifeturned him to stratum 6 for his te&th' item. With a few minor deviations, he 
then essentially alternated. between^ korrect and incorrect responses from 
stages 11 through 20. Item 20 termiiiated the stradaptive test since the 
tSing pro^^^^ point] located the student's ceiling stratum; 

at<stratum.8 he%iad answered incorrectly all five items. 
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The Conventional Test 



As 1^ previous* studies In this series (Betz and Weiss, 1973, 1974; 
Larkln and Weiss, 1974), a conventional test^was administered by computer 
to provide 4 comparison with the stradap*tive tests. This vasva 40~ltem 
^peaked tecEt for Which no rming summary statistics are shown In table 2. Item 



Table 2 



Summary Statistics for Item f^rameters of 
the 40-Item Conventional Test 



Barameters 



Mean 



S,D. 



■High 



Lpw* 



Traditional 
, p J proportion 

correct' .537 

ibis,, blserlal v 

with total score ,472 



.010 
.078 



/ ■ 

• 661 

• 612 



.267 
.'296 



Normal Ogive 

ib, difficulty .-.188 
a, discrlmlnajtibn .543 



.592 
.112 



1.155 
.774 



.956 
.310 



Estimated using foriuulas described by McBride and Weiss (1974, p. 24) 

statistics for -this test are presented in Appendix Table A-2. According to 
Betz and Weiss (1973) who con6txucted the test, "Items were selected from 
f the pool [of 369 items] that liad difficulties closest to p«.55 and item 
total score biserial correlation coefficients closest to .45," (p. 15). 
The score for the ^conventional test was the proportion of items answered 
correctly by each testee. . 

Subj ects t 

.Subjects providing the data for this study wete college students. Some 
sophomores were recruited from the psychology department ' s 'subJ ect pool- 

"but the majo/rity of the testees were Juniors, seniors, and flifst-year graduate 
students flrbm courses in psychological statistics and measurement. To obtain 

' test-retest stability data> some sul>jects were tested and then retested 
after an interval of ftom two to 11 weeks. Valid test-pretest data were 
collected on 180 testees for Stradaptiv^ 1^ 98 testejss using Sttadaptive 2, 
and 194 teStees using the conventional test. To gbtaitt the best possible 

^'^^^fiXi^^tlP^^Land intercorrelational data on the stradaptlve tests, single 
administration datl'were gathere^'from' other facets of the research program's 
general data collection yielding initial test data on 476 students for 
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Stradaptlve 1 and 113 students for Stradaptlive 2. 

Analyses 

Comparison of the Stradaptlve Test with tzhe Conventional Test 

These analyses were designed to Investigate the characteristics of the . 
various scores derived from th6 stradaptlye test, In comparison to scores 
derived from the conventional test. Of Interest were the" characteristics , 
of their score distributions, as well as Inter-relationshlps among scores ' ■ ■ ^ 

derived from the two testing strategies. In addition, the relative stability 
of ability efitiinates derived from the two testing strategies was considered 
iMportant, as It was In previous studies (e.g., Betz and Weiss, 197A; Larkln 
and Weiss, 197^). Score stability was viewed both as an indication of the- 
relative reliabilities of ability estinltes derived frop the two testing 
strategies and as an indication of the practical utility of the ability estimates • 

/for making longitudinal predictions. ^ ** 
/ , . ' ■ . \ ■ _ 

„ gescriptilve statistics . The mean, standard deviation, skewness, and 
kartosis were calculated for all score distributions obtained from Stradaptlve 
1, Stradaptlve^, and the conventional test. The underlying ability distribution' 
' of/ the popoiation sampled was not known. Therefore, scores derived from the' 
different testing strategies could not be evaluated- on the bas*Ls of hoi/ well 
they reflected the distribution of true ability. A normally distributed score . 
distribution is statis,tically convenient, however, and for this reason alone,' / 
siiores that depart radically "frop normality are. utideslreable. . " . 

' fk . Internal consistency . In practical testing applicp-tions, internal con- 
sistency is calculated as an approximation to parallel forms reliability. 
Ability tests are typically constructed to maximize internal consistency. In 
inter-strategy comparison research, such as is reported here,, the goal is to 
equate internal consistency across strategies. Given a unidlmensibnal trait, 
. the internal consistency is partly a function of the discriminating power 
of the items (Gulllksen, 1950). Thus, testing strategies which are equated , 
for. internal consistencies can then be meaningfully compared in terms of \ 
stabilities, since all strategies will have equally good items. 

Internal consistencies were calculated in thisVtudy for both the con-,, 
ventional and adaptive tests. These data were then used .^s a basis for 
/statistically equating. t*le discriminating power 5f the item pools to provide 
a more realistic comparison of the test-reteat stabilities; of th6 two testing ^ 
strategies. ' ' • ' >■ 

Calculation o*f the Internal consistency reliability of an adaptive test 
cannot use standard approaches because 1) all individuals do not encounter the 
same items, and 2) those items they do encounter cannot be thought of. as a 
random sample from the total pool. Some adaptive testing st-rategles, such as ^ 
two-stage, allow Internal consistency to be calculated on subgroups of items 
(e.g., Betz and Weiss, 1973) but this* is usuaU-y an-underestimate due to ^ 
restriction of range in ability. Larkln and w'eis? (1974) were able to calculate 
internal' consistency for a pyramidal test using a scoring technique that ■ • 
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predlcted a testee' 8 response to all those items of tbe test which he had not 
actually encountered. They concluded, however, that the resulting internal 
consistency was an overestimate due to the assiunptions made by^ the scoring 
technique. . , ^ ' ' 

A different approach to estimating the internal consistency of an 
adaptive test was taken in this study. Gulliksen (1950) presents a formula 
for calculating internal consistency reliability from item reliabilities 
(i.e., the item-total correlation weighted by the item variance). The formula 
•(Equation '21, p^ 378) is a variation of the Kuder-Richardson formula 20 
(KR-20) I ^ , 



1 



/•E r , s y 



[1] 



where r 



XX 



8 



and r 



9 



xg 



internal consistency of the test 
number of items in the test 

item variance-p(l-p) V where p«proportion correct 
correlation of item response and total scorg. 



This formula, as derived by Gulliksen, is strictly correct mathematically only . 
when it is used to calculate test reliability from the same sample of items 
on which the item reliabilities were calculated, and hence offers no direct 
advantage over the KR-20; But a reliability .coefficient can be obtained by 
assuming that item-total correlations and proportion correct data obtained in 
the norming study (McBride and Weiss, 1974) are acceptably estimates of 
the item-total correlations thdt would have been obtained if a representative 
sample of the population of individuals had been given the items of interest. 
As a futther departure from standard usage of this formula, the biser'ial rather 
than the pbint-biserlal item-tbtal correlation waS ,used for the calculations. 
Although the formula was derived in terms of point-biserial correlations, 
the point-biserial is affected by the difficulty of the items, dropping 
when items are very easy or very har^. Thus, the biserial correlation, is 
more appropriate for use in an adaptive test, since item difficulties (and, 
therefore, Item-total point-biserial correlations) will vary with ability levels. 

In this study, the internal consistency coefficient was calculated for the 
stradaptive tests as' follows: Substituting norm group item parameters irtto ^ 
Gulliksen' s formula, a teliability coefficient was calculated for each person s 
set of •items. This reliability was then inflated or deflated to a length of» 
29 items (the mean' length of" the stradaptive test) using the Spearman-Brown 
formula. These coefficients were then averaged across all individuals 
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using an r to transforaation. This yielded an internal consistency cbef- 
ficient characteristic of a set of 29-ltem conventional tests assembled from 
the stradaptiye item pool with test difficulties distributed as'a function;o£ 
the underlying ability. -I * „ 

To determine the utility of this^ technique, it was tested empirically. 
Internal consistency was calculated by Equation 1 for four subsets of 10, 
20, 30 and 40 items from the conventional test." The" coefficients obtained \ 
were inflated to' a length of 40 items and compared with each other and wit;h , , 
a Hoyt internal consistency reliability coefficient calculated on the total 
conventional test. - J * ^ 

>^ Test-retest stability: the problems . Test-retest stability coe^f f icients 
<>ere of prime interest in this study as a meand of comparing ^he relative . ' 
pi^ecislon and practical utility of ^scores resulting from the stradaptive 
and conventional testing Arategies. Unfortunately, the conventional test was 
constructed to match the- j>sychometric characteristics of a Igwo-^tage test and 
its match with the stradaptive testes was something less than optimal. The first 
problem encountered with the conventional test*was the f act %hat it was longer 
than the tyi^ical stradaptive test.* The average length of fitiadaptive 1 . 
was 27.75 items on initial testing and 31.35 items on fetest. Average 
lengths for Stradaptive 2 were 25.38 and 26.61 when question mark' responses * * 
were not counted. When questrlon mark responses were counted, those lengths 
rose to 29.23 and 30.64. The 40-item conventional te6t had the clear 
advantage' with respect to test length. * ' ' * 

Also in favor of the conventional test, with regard to estimating test- 
retest rfeliabilities, was the fact that it had all forty initial test iteins^ 
repeated on retest, thus inflating the test-reteet isorrelation because of 
memory effects. The existence of membry effects with these items was 
demonstrated by Betz apd We:^s /1973)^ and Larkin and Weiss (1974). 

Working against the' conventional test was tbe fact that its item 
discriminations were lower than those of the stradaptive tests. The average 
normal ogive discrimination for conventional test items \Jas a»-.543. The 
aver age. discrimination for. all the items In the stradaptive pools 'were 
a-. 746 and a-. 717 for forms 1 and 2 respectively. However, since , the 
stradaptive it mi pool was constructed so that the most discriminating items 
a^re** administered first, the average discrimination of- the items actually ,^ 
administered was higher than the^ average discrimination of all items in 
the item pool. The average discriminations of all dtemd administered, * each 
item weighted by the number of times it Was administered, were a«.841 and 
a-. 879 for Forms 1 and 2 respectively. This result clearly favored the 
stradaptive tests. 

The final inequity was that the stability of the stradaptive tests was 
influenced to Some degree by the use of initial ability estimates for entry 
\ points.' -The i|iltial ability estimate obtained prior to the first testing 
was used on,J)oth the initial test and the retest. Therefore, as the test 
length app^Oaohed zero items, the stability approached unity. Although the 
shortest test contained nine items, this factor still likely had some influence 
on the test^etest stability of stradaptive scores. 



Test-rete8t stability; the corrections , A^length of 29 Items was 
^aken as an apprdximate average of the lengtha^of the stradaptive tests. 
This*J.s eleven items shorter than the forty-Item conventional test. * 

While the conventional jtest hacl 4II its items repeated on retest, the 
stradaptive tests rarely had the initial item set repeated on retest. "To 
calculate the proportion of items encountered in both initial test and retest 
for Stradat>tive 1, the smaller riiaiiiber of items within a stratum, on test or 
retest « waa taken as the number, of common Items in that stratum. This ntmber 
was summed over all strata and all individuals, and was divided by one half 
the. total number of items administered on test and retest to all individuals 
who took* the Stradaptive 1 test. The* proportion bf common items for Stra- 
daptive 2 was c^alculated in the same way* But where the 'Stradaptive 1 calcu- 
lation gave an exact figure for number of items encountered twice /the 
Stradaptive 2 calculation yielded only an approximation. This was because 
totals within strata for Stradaptive 2 did not include question-mark responses. 
The proportion of items common on test and retest wag .615 for Stradaptive 1 \ 
and .567 for Stradaptive 2. , . 

f ■ 

Within the conventional test, memory effects and lengths ^ere equated 
simultaneously by preparing, from the original set of forty items, five 
analogous test pairs. Each pair consisted of one randomly fleeted test of 
twenty-nine items and a Second test containing the remaining eleven items 
and eighteen of the first tests' twenty-nine items. This yiel^^ed five pairs 
of twenty-nine item test^, each' pair having 18 or 62% of their it^s in 
common , thus matching the average proportioii of items, in common on the 
Stradaptive 1 retest. Items for one test in each pair were scored from 
the initial test data and items for the other were scored from the retest 
data. As an estimate of stability of such an analogous form the mean ir to 
z transformed) correlation between members of the five pair^^as used. 

<• ^ . - ' 

A direct correction for the effects of differences in item discrimination 
on test-retest reliability was not available. A correction was Implemented, 
however, based. on the fact that item discrimination has an effect on internal 
consistency reliability, which h^s an effect on validity. It was further 
assumed that correlational Validity is .in some respects analogous to test- 
retest, reliability. Gulliksen (1950) provided a formula (dq. 8-19, p. 83) for 
calculating the necessary increase in tes^: length to obtain a desired l4;iternal 
consistency: 

* * • • 

/where»if proportionate increase in length 
r = the original internal consistency 

J? = the desired internal consistency ^ . . 

He also provided a formula^^q. 9-19 ^ p. 98) to predict the change in validity 
of otie test in predicting another as a function of changes in the lengths of 
both' tests. Jn th& case of stability coefficients where both tests are the 
same and both lengthened the same amount, that formula becomes: 



where = corrected test-retest correlation . 
' , r^^ H originall test-retest correlation ^ 

r = original internal consistency^'^ ' 

axe ^ 

K = proportiotiafce increase in test length from previous equation 

Equation 3 may be recognizedas a, variation of the Spearman-jBrown formula. 

' To correct for unequal discriminations, the conventional test internal - 
consistency calculated using the norming parameter method described earlier 
was substituted for the original internal consistency in Equation 3. The average 
internial consistency of the stradaptive tests, the calculation of which was 
described earlier, was substituted for the desired internal consistency. 
From this, the proportionate increase in length of conventional test required 
tp compensate fdr different discf Ijninations waa calculated. Then the average 
stability of the five pairs of analogous conventional testd was inflated using 
Equation/ 3 to the value expected had either the tests been lengthened to compensate 
or the discriminations been equivalent. 

It should be noted that the presence of thepe many correcti ons precludes 
the drawing of any strong conclusions -from this study regarding test-reteat 
stability. Several stability coefficients were calculated, however. Both , 
test-retest product-moment correlations and eta coefficients were calculated 
for the forty-item conventional test and the five pairs of analogous forms. 

Finally, to assess the maximum inflation of the stradaptive stability 
coefficient that Could be caused by the initial ability ep.timates, partial 
correlations between test and retest administrations of the two stradaptive 
tests were calculated, with initial ability estiinate partialled. The partial 
correlation is probably an underestimate of the stability coefficient that would 
have been obtained had initial ability estimates actually been held constant. 
The reason for this i£( that th^ initial ability estimate has ^both valid and 
error variance associated with it*, and both the valid as well as the error 
variance are removed by the partialling procedure. Thia^ partialling problem 
was discussed in detail by Mefehl (1970). For purposes of comparison, 
the initial ability estimates were also partialled out of the conventional 
test stabilities. The correlation between corjventional test score and initial 
ability estimate can b6 construed as common varl,ance due tb the underlying 
ability, and any reductioa in conventional test-^te)test correlation reflects 
how much the stradaptive reliabilities were artifactually deflated in the 
partialling process. ^ t 

Both to make the stabilities more comparable and to observe the effect 
of time on stability, testees were divided into subgroups according to the 
length of the test-retest interval: 0-13, 16-30, 31-45, 46-60 and over 60 
days. Product-mopent stability feoefficients were then calculated using 
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scores of those testees within each time, group from both Stradaptive 1 arid 
conventdonii tests • Stradaptive 2 data were. not included in this analysis 
because virtually all test-retest Intervals fell into one of the above 
groups (thus precluding trend, analysis). and too few to analyze meaningfully 
fell into a time, period overlapping with a period f rom« the other two tests 
•(thus precluding analysis within comparable intervals) . 

_ • Correlations between stradaptive and conventional tests . Stradaptive. 1 

scores were correlated with the forty l^em conventional- test scores for those ' 
testees who completed both on the same occasion. This correlation was computed 
to determine whether the stradaptive and conventional tests were measuring^ the 
same ability. Stradaptive 2 scores were not cqrrelated with the conventional 
• test sc0re because no subjects were given both the Stradaptive % test and the 
conventional test. 

Further Analyses of the Stradaptive .Tests ^ 

Intercorrelations among stradaptive scores . Intercorrelations among scores 
on the ^tradanjtive test wfere calculated for , the initial administrations of 
both stradaptive tfests. This was done to provide a basis for reducing the 
number of scoring methods. If several scores are to' be calculated, they must 
be sufficiently independent of each other in order to provide differential 
^ information. . 

Utility of the atiradaptlve consistency' scores In predicting s tability. 
The five consistency scores were 'proposed as predictors of stability of 
the ability scores. To determine whether a consistency score functioned In 
this manner* "subjects were first divided into five groups on the basis 
of that score on initial testing, and then wlthln-group stability analyses 
were performed. Specifically, Stradaptive 2 testees were first ranked on the. 
basis of a consistency score. This distribution was then divided Into five 
' groups with approximately equal numbers of testees. Stradaptive 1 testees 
were then grouped on the basis of cutting scores established In the 
Stradaptive 2 division. Stradaptive 2 was chosen for the Initial division 
in order to provide a sufficlerit number of subjects in each group to allow 
meaningful analysis, since the totall number of subjects who completed 
Stradaptive 2 was smaller. After division Into sub-groups, product-moment 
test-retest coefficients were calcularted within each of these groups, 
ranging from a group of highly consistent testees to a group of highly 
Inconsistent testees. ' 

w - ■• ■ ■ , • ' • 

< This analysis was performed on only three of the consistency scores— 
• scores 11, 1^, and 13. The scores analyzed were chosen because they were 
all standard deviation scores and this allowed a direct comparison of 
scores 11 arid 12*; the overall variability scores, with score 13, a statis- 
tically similar score based on variability between celling and basal strata. 

Stability of stradaptive test- response records . Weiss (1973) suggested 
that ability scores might be estimated from a tes tee's stradaptive test . 
response record using "subject characteristic curves". These curves are 
analogous to "trace lines" and are based on a testee's obtained proportion 
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correct at each strattun. He suggested that analysis of these data to obtain 
ability estimates might proceed along the lines of estimating normal pg;ive, 
item parameters. Such latent parameters were not estimated in "this study. ^ 
But, ,to facilitate fujtiire , research into the utility of such data, several , 
indices of the stability of the stradapti^^e test response record were computed. 
Thus, the. stability of stradaptive test, length was determined from a product- 
moment correlatiDxri)etween ntimber of items administered on initial tei^t and 
on retest . . . . 

No coWoii index existed for overall stabilitV of the subject characteristic 
curve data, ap reflected in total number of items . answered within strata bi> 
proportions correct within strata. ,The form of/the. data, however (multiple 
continuous predictors and criteria), suggested the canonical correlation 
model. Thus, canonical correlations between test and retes.t data were ' 
computed. Two canonical analyses were Implemented, using as variables in 
one analysis the number of items administered ih each strattun and, in the 
second analysis', the proportions correct at each stratum. 

For this cartonical analysis, there were usifally several strata in 
which no items were administered and thus proportions correct could not be 
calculated; To remedy this, the proportions correct below the -ceiling 
stratum were set to zero. Zero was used rather than the chatxce level 
because, in th6 stradaptive testing strat^egy using an up-one, down-one 
branching strategy*, unless the testee runs out of difficult strata, he jgets 
no items correct at 'his highest stratum. 

^ * . ■ ' '•, [ ■ 

A complete redundancy analysis (Stewart and Ldve, 196S; Weps,"1972) 
was petformed on the canonical correlations. The Redundancy index of 
greatfes^t* interest here is the redundancy of the retest given ,the initial 
test. This can be Interpreted as the proportion of variance in the retest^ 
ddta predictable from the initial test data. It. is also int^rpretable as 
the average squared multiple correlation of scores on each retest stratum' ^ 
xdLth all scores on initial t6st strata. This redundancy coefficient 'bears 
some similarity to a fiest-retest reliability coefficient, but it expresses 
the stability of characteristics of the response records on the stradaptive^ 
test rather than mejrely 'the stability of sximmary scores as does the traditional. 
»test-retest reliability coefficient. ' . ^ 

j' - • ( 

Relative difficulties of items producing different kinds of responses. 
.^One objective of this /study was to exaMne the effects of not penalizing' 
• testees for honestly admitting they were not sure which multiple-choice 
answer was correct. This comparison was possible since Stradaptive l,was 
designed to treat a response as incorrect (thereby branching to a less 
difficult. item) while Stradaptive 2 treated the same response as "no infor- ^ 
mation'^ and presented another itam at the pame stratum. 

In addition to the test-retest data ahowing the relative stabilities 
of scores on the two forms of the stradaptive test, an- analysis was done 
to determine if the average difficulty pf items answered with a question 
mark was equal to a testee •s ability, or more difficult than his ability, 
and If more difficult, how much more difficult i . \ 
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Score 8, the average .difficulty of all iteins correct, was u^ed as ai 
estimate of ability. The difficult:^ of each item administered to an / , 
'individual wa^ deviated from that iridividuai'0 ability level (operationalized 
as Score 8). These deviated difficulties were grouped into difficulties , 
of correct, incorrect, and question mark resjpqnse itemSj-and then. pooled 
over all individuals for Stradaptive 1 and Stradaptive 2 administratdons ; 
separately. Both initial t^st and^rHest. data; Were usecfc Means of these . 
deviated difficulties yielded the average, distance from ability, in nprmal 
ogive difficulty units, of items generating the yprio^is types of responses. 
Standard deyi'atJLons' of t)ie deviated difficulties were also computed, v 

• Test len gth vs. ability . AbiUty- scores derived from- stradaptive. testihg 
wete correlated with test length. This analysis was designed to determine 
whether there- were interaotions of scdring methods with ch^racterlstica^ 
of the item pool which resulted in different correlations Of scores derlyfed^ 
from each method with the 'number of items required to reach the termination . 
criterion. A slight correlation was expected because the more discrJj^lnating ^ 
items available at the lower difficulty strata were expected to yi0ld fewer 
incorrect branchings and thus faster terminations. , . ■ 

RESULTS. „ ' * 

Comparison Of The Stradaptive a nd Conventional Tests 
Descriptive Statl6tlCfl ^ \| ^ . 

Descriptive statistics for the initial testing of stradaptive 1.. V 
Strad:;tivl2. and the convent^onal^tests-are sh^^^^ 

data are summarized in Appendix ^^/.^'^^^S'^tgive scoring units and ; 
and of <^on«i«^-"^y/'^°^f„,^\'5\3\rstiatmTits'but its standard devi.tiW 

^l.l^lTll"%5 TZ\llA\T:str.tu. in normal ogive difficulty 
uhits3 to be- comparable to the other scoring methods-. _^ I . . 

The ten Stradaptive 1 ability scores sho^^^^^^^^ ^-fr'""" 
Most of these ability scores had a f 8!^^^^f,3P°^^f ^^ores! scores 14 and * ^ 
•significantly platykurtic. The ^"^^/^"J'^Ions tS^n the standard '^^^ ' 
. isf «how higher means and l^^^^^f^f J^^^^f ^^f J^"' ^ consistency sco.es^ 
deviation consistency scores, scores ^"J^iJ'^^S^bility consistency ; 

were Significantly positively skew^^^^^^^ 

'!.^5yr':-^:rrL' 3 I4 ';ndT5,'Snged.fro;nortnal to significantly pl.ty4 ^ 



indices,' scores 13, 14, and 15, 
kurtic. ■ 



Means of Stradaptive 2 ability '^^fj'lll^'^t'l^l.^.l^ saSler.. 
Stradaptive. 1 , "-J^jJ,?^' J^^S aS although the values were- . 

All ability "f^J°="^''S.f™rc significant due to the smaller' . 
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TaSle 3.1^ 



Characteristics of ScQre Distributions for Stradaptlve 1, 
Stradaptlve 2, and the Conventional T^t ori Initial^ T^^^^ 



Stradaptlve 1 



Score 



Hean 



S.D. 



Skew 



Kurtbsls 





Ms ' 


1.073 


1.187 


-.080 


-.785* 


2 . • 


476 


,560 


1^468 


.402* 


r.,714* 


3 


476 • 


.^531.. 


1.3?4 


..3i;j^ 


-v680* 




476 


1.019 


1.148 


-.3D6*^ 


-.893* 




476 


- ^570 


i.453 


.350* 


-.680* 




476 


*370 


1.274, 


.172 


t-.797* 


■ 7 ■ • ^ ■ 


476 


.440 


1.241 


.229* 


-.157* 




476 , 


-.042 


1.055 


.324* > 


-.613* 


^ < 9 


4:20 


' .06€r 


1.122 


.340*. 


' -.507* 




? • 475 . 


.339 


1,270 >: 




-.770* 


Cbnsls^n%^ Scores 














' 476 


.753 


' *^1^6 ' : 


.661* 


.780* 




- 476 


.661 


.211^^ 


.570* 


1.322* 




420 


i 380 


^^v;m^19 


.348* 


-.605* 




476 


1.92^ 




: .569* 


-.034 


. ■ 15^ ; ■ . 


. 476 


1.94#r^ 




s^71* 


.-.007 



Stradaptlve 2 


Score 




. N 


Mean . • S.D. 


. Skew 


Kurt 0 sis 



Ability Scqres : ^ 

1:.', ' ^ 113 ^ ^774 

2 • 113 .173 

• 3 112 .167, 
4 113 .748 

r'^S^:':- V,: 113 - .188. 

• % 6 ^ . 113 -.006 

* 7 113 .085, 

8 - ' li3 -.356. 

, 9 94 -.241 

.iq ' 112 -.004 
Consistency Scores 

11 113 .752 

12 113 ' .667 

13 ^ 94 .389 

14 - 113 1.815 

15 113 ' 1.788 



1.064 
1.212 
1.084 - 
1.047 
1.179 
«i.l20 
1.077 V 
. ;853 
.944 
1.077 

.196 
.225 

..195 
.832. 
.822- 



.305, 

. 537* 

.636* 

.176 

.579* 

.355 

.445 

.442 

.622* 

.566* 

.^78* 

.406 

.274 

.523* 

.538* 



-.527 
.571 

-.048. 

-:871 
.790 
.170 . 
.233 

-.376 
.126 

^;117 

. 1:524* 
.785 
-..491 
.113 
.171 . 



Score 



Conventional Test 



N 



Mean 



S.D. 



Skew ; 



Kurto^ls 



40-ltems 
, 29-it*em 



analogous form 



194 
194 



.•588 
.588 



,209 
.21 



-.110 
-.129 



-.945* 



.884* 



*S.D. is multiplied hy .S5. 



\ll statistics for the 29-iteffi analogous form are means of statistics 

calculated on five combinations* of ite^ ; 
*51gnlficatttly diff ei^ent frop zero at pk<:*05- 



'-■■^ Score distributions were essentially the same for the forty-item 
conventional test and the twenty-nine item analogou3 forms. Both distributions 
were symmetric around means of ,588 and both were significantly flat. 

Because of differences in scoring methods, no direct, comparisons of 
means and standard deviations can be made between the conventional and 
str4daptive tests. The noticeable positive skew of the stradaptive scores 
wasvabsent in the conventional test data.' Platykurtosis was similar in . 
Sttadaptive 1 and the conventional test, but was not evident in Stradaptive 2. 

Internal Consistency 

Table 4 presents the internal consistency of the cohyentiohal test 
calculated using the item reliability formiila, |quatiori 1 on page 11, from 
subsets of lOi 20, 30 and 40 items. .Alsd-^Wwn is the internal consistency 
calculated from Hoyt's formula using all 40 items. Since the average length 
of the Stradaptive test was about 29, items, the estimated reliability for 
a conventional, test of 29 items (shown in row 2 of Table 4) is mo.st relevant 
for this study. No d-efinite trend is apparent in the corrected coeffic;Lents 
as the number of items used is increased from 10 to 40. The coefficients 
do appear to be higher than those obtained from Hoyt's method, however. 

Table 4 

■ » ,• • 

Internal Consistency of the Conventional Test 
as Estimated from Subsets of Items Using jthe ^ — 
Item Reliability Method, Compared with Inte^&nal ,w ' 
^ Consistency Calculated Using Hoyt's Method ^ ' . 



Internal Consistency 



Corrected to 29 item 
length . 

Corrected to 40 item 
length 



Number of Items 



10 20 30 ' 40_ 'Hoyt 



a 



Of item samples .686 ^ .833 .887 .911 .893 

.863 .879. .883^ .881 '.858 

.897 .909 .912 .911 -893 



Based on 40 Items. 



Internal consistencies of the stradaptive tests calculated from the item 
reliability formula are presented in Table 5. As can be seen, they are 
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substantially higher than those tl^ conventional test (as sho^ in JP&ble 4) , 
a result due to the higher .item discriminations in the stradaptivfe item pools. 

. Table 5 

Average Internal Consistj^ncdtt 'of the 
Stradaptive Tests (Unweighted v X,o z 
Averages of Internal Consistencies) . 



Corrected Length ' Stradaptive 1 Stradaptive 2 Average 

29 items ^ ^ .935 .942 .938 

40 items .952 .957 *ftS4 



Test-retest Stability / ^ 

Test-ret'est stability coefficients for the stradaptive t;ests are presented 
in Table 6. ' Conventional^ test coefficients are presented 6i Table 7. These 
tables contain zero-order' product-moment correlations betweW test and retest, 
partial correlations between test and retest with initial abi;Lity estimates held 
constant, and eta coefficients. ; 

Zero-order stabilltles .y^ Based on the zero-order correlations shbwH in Table 
6, the average^ difficulty scores, scores 8 and 9, were the most stable ability 
scores on both forms of the stradaptive test. Scores 2 and 5, the (N+l)t?z item 
and stratum scores, were th^ least stable scores on both forms. The ^emain<^r 
of the ability scores fell between these. 

These differences between stabULities of the stradaptive ability scpxes 
appear to be a futiction of the amountNjf information used by the scores. The 
average difficulty scores* are highest because they make use of information 
gained from all, items administered. The (N+Dt^i item and stratum scores, on 
the other hand,/ are least stable because they are heavily^ dependent on ttie 'response 
to the last item. A correct response on the final Item makes these scores two 
stratum units higher (1.30 normal ogive difficulty units, or 20% of the score 
range, based on score ranges of -3.25 to 3.25) than does an incorrect response 
to the same item. ^ . , 

As Tabl'^e 9hows, stabilities of the consistency scores were much lower 
than those of the ability scores. For the Stradaptive 1 data, the stabilities - 
.of the overall variability scores, scores^ 11 and 12, Twere highest at .569 and 
.496 respectively. The stabilities of the scores representing variability between 
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Table 6 

Test-Retest Stabilities of Stradaptive Teats 



^ '. ^ — L ^ '• • ' ■ • " nr- — •■ ^ — 

Stradaptive 1. 




£j c Jt. W VJ IT U c Ju 


jraruxax 


jcica 




V/ U U XT c X a u X.U n > 


uorrexauxon ^ 


uoenxcient ^ 








an«i/u^ 


1 
X 


• O JO 


^fll A 


i QTTQ 

; • ojo 


** 


• / 0/ 






3 


.845 


' .833 • , 


li . .849 


4 • 


.833' 


.810^ 


!; ,833 


5 


.787 


.773 


f . .787 


6 


/.841 


.829 ' 


; '^.84l^ 


7 V 


' .872 ' ' 


.861 


.878- 


8 


.920 


' .901 


.920 - 


9 


.912 


.902 


.912 


10 


.842 


.829 


.851 


CQiisistency 








Scores 








11^ 


.569 


.577 . 


.577 




.496 


.485 


* .569 


13 


.252 


* .234 


.327 


14 


.328 


.324 


.398 


15 . 


.321 


.318 


.364 




Stradaptive 2 




f 




Zero-Order 


rarcxai < 


tUZB. 


Ability • 


Correlation 


Correlation 


ttOerricienc 


Scores 


tN*79; 




/1a«70\ * 


1 


. /I/ 


. 0/0 








. oux ^ 




3 


* .741 


.733 


.741 


4 


.690 


.642 


.734* 


5 


.654 


.625 


.695* 


6 


".717 


.708 


.738 


7 


.741 , s 


.730 


.778 


8 


.823 / 


.789 


.823 


9 


.811- ^ 


.792 


.811 


10 


.746 


.737 


.746 


Consistency 








Scores 








11 


.510 


.510 ' 


.553 


'12 


.300 


.298 


.385 


13 


.110 


*134 


.337 


14 


. .128 


' .171 


.274 


15 


.120 


.164 


.152 



Note: N*s reported refer to the number of subjects with valid data. 
Stabilities £pr sotite scoring methods are based on fewer subjects. 



*Curvillnearity significant aft p<.05 
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ceiling ancbbaaal strata, socres 13, 14 and 15, were/too lew to consider 
those scares represenatative of a Stable trait. This does not necessarily 
imply, however, that the consistency scores will not be useful in other 
ways, such as moderating test-rete^t reliabiliN-y. of the ability scores. ^ 

Three zero-order correlatioits are shown in Table 7 for the total 
group who completBd the conventional test: 1) the test-retest coefficient 
for the entire 40-it em test; 2) the average test-retest coefficient for the 
five analogous forms (correlations for the five= pairs were .876, .873, 
.871, .662, and .890); and 3) ,the^ average correlation corrected to equate 
internal consistencies * of the stradaptive and conventional tests . The 



■ 



Table 7 



Test'-Retest Stabilities of the Conventional Test 



Form 



Total 
Group 
(N»l94) 



Testees with Initial Ability 
Estimates Available (N»81) 
Zero-Order » Partial 
Correlation Correlation 



40-item .913 

29-item analogjiius form .874 

29-ltem analogous form 
corrected for 

internal consieftency .931 



.900 
.868 



.889 



.856 



ERIC 



corrected/alue was obtained by inflating the test-retest correlation 
using the method described above, from its value given a conventional test 
with internal consistency of .881 (the internal consistency of a 29-item 
test) to the value expected with a coijiventional test haying an internal 
>consistency of .938 (the average internal consistency of the stradaptive 
tests). The first two coefficients, while statistically more sottnd, are 
psychometrically inadequate for comparison. The latter value* while 
theoretically the fiiost adequate of the thtee for comparison with the stability 
o£ the stradaptive test, rests on many corrections and assumptions. It is, 
tlierefore, only a rough estimate of the stability of a conventional test 
psy(ih6metrically equivalent to the stradaptive tests except for strategy 
of administration. As can be seen from Tables 6 and 7, the corrected 
stability of the analogous form of the conventional test (r-. 931) is slightly 
higher than the highest stradaptive ability score stability (2^-. 920). 
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Partial correlations . Even though partial coxrelatlon analysis, 
removed Valid as well as extralieous variance, the. redmictlon In correlations 
was slight. The average (a:* to 2 transformed) ability score stability 
correlation dropped from .842|l:o .838 in the Stradaptlve 1 data and from 
.732 to .709 for Stradaptlve^ (Table 6). The reduction In conventional? 
test stabilities shown in Tatle 7 was equivalent tp thai: found in the 
stradaptlve tests even through initial ability estimates/were not able I 
to inflate the zero^order conventional test stabilities. This suggests 
that the artif actual inflating effect of the intial ability estimates oq, 
stradaptlve test stabilities was negligible. • ' - 

Eta coefficients . No Stradaptlve 1 score showed significant curvi- 
linearity in the relationship bietween the test and retest score distributions. 
Of the three Stradaptlve ,2 scores with significant curvilinearity , no low 
order trends were apparent in the bivariate scatter plo'ts. This suggests 
that the relationship between stradaptlve initial tes-t* and retest scores 
is essentially linear. 

Test-retest interval . . Table 8 presents test-retest stability coefficients 
as- a function of the length of /the test-retest interval* With the exception 
of score 4, all Stradaptlve 1 ability ^scores show monotonically decreasing 
stability with increasing time interval. The greatest decreased are observed 
for scores 2 and 5, the (N+l)*3^ item and stratum scores. Scores 1,3, 6 
and 7 appear to be iB^tle affecte^d by test-retest interval. Consistency 
scores 14 and 15 show>lncreasing stability over time.. 

The 29-item analogous form of the conventional* test had a considerably 
lower test-retest reliability (.828) than the best of the stradaptlve 
scoring methdcfs fscore 8, r"'.932) in the shortest (30-45 day) time interval. 
In the '61-29 day retest interval, the 29-item analogous form had a retest 
correlation of .860 while the retest stability of stradaptlve score 8 was 
.848 and that of score 9 was .858. The 29-itean analogous form corrected for 
internal consistency had a retest correlation of .916 in the 61-79 day 
interval, considerably higher than any of the s|f adaptive scores. Again 
however, the legitimacy of the numerous corrections involved in these data 
must b^ taken intOv account . 

Although the distribution of test-retest intervals did not al^ow inclusion 
of Stradaptlve 2 data in this analysis, an observation worthy of note is 
that the total group Stradaptive 2 stabilities Were uniformly lower than the 
total group Stradaptive 1 stabilities (Table 7) even through the mean ♦ 
Stradaptive 2 test-retest interval was much shorter (24.6 vs. 47.9 4ays) . 

Correlations BetWeen Stradaptive 1 and Conventional Tast Scor.^s 

Table 9 shows product-tnoment correlations and eta coefficients of 
40-item conventional test scores with Stradaptive 1 test scores. 'The highest 
correlations, those ©f scores 8 and 9 with the conventional test > when 
corrected for attenuation using test-retest ' stabilities were .94? and 
.938. Three of the Stradaptive 1 scores show significant curvilinearity in 
their relationship with the conventional test -score. This is probably due to 



Table 8 



Test-retest Stability Correlations 
for Stradaptlve 1 and the Conventional Test 
by Test-retest Interval 



Stradat)tlve 1 



Test-Retest Interval 





^otalt^Sroup 


31*45 Days*^ 


46-60 Days 


61-79 Days 


No. of Testees 


170 


DO 


84 

** 


18 


Mo> of Days 




40.333 




^ A "7 0 0 


Mean 


47.870 


51^311 


SD 


8.931 


3.202 


3.341 , 


Z.dOo 


Skew 


-.401 


-.347 


.449 


.661 


Ability Scores 






V 




1 


.837 


,842 


• .818 


.802 


2 


.786 


.829 


.781 ' 


.591 


3 


.844 


.850 


.841 


.825 


A 


.833 


.857 


.783 


'■.853 


5 


.787 


.828 


.792 


. .550 


6 


.841 


' .851 


.832 


.825 > 


7 


.872 ' 


.885 


„.-,.859 


.858 


8 


.919 


.932 


^U-914 
H % f»08 


.848 


9 


, .911 


.923 


.809 


10 


.841 


.853 


. ^ .d35 


.816 


Consistencsr Scores 






.608 


.587 


.11 


.569 


.503 


12 


> ,496 > 


.555 


.457 


.513 


13 


.251 


.285 


.178 


.354 


u° 


.327 


.250 


.330 


.553 


15 


.321 


.236 


.331 


.540 



\ 



Conventional Test 



Tes t-r et est Inverval 



Total Group 



31'^45.Days 



46-60 Days 



61-79 Days 



No. of Testees 
No. of Days 
Mean 
SD 
. Skew 

40-ltem test 
29-ltem analogous form 
29-it6m' analogous form 
/ corrected for In- 
terna}, consistency 



c. 



194*^ 

53.567 
8.149 
.-.808 

.913 
.824 



.931 



28 * 


130 


, • 35 


41.750 ' 


53.469" 


64.743 


1.266 


3.611 


4.010 


-.077 


.073 


1.855 


.905 


.924 


.879 


.828 


.886 


.860 


.882 


.943 


.916 



some scoring methods are 
based on fewer testees. 
^Within group ^8 do not add to total group N because of a single case Un ' 

the 0-15 day /Inverval. 
^All statistics for the 29-ltem analogous form are means of statistics from 
five combinations d£ Items. 
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different scallngs of the scoring methods . 



Table 9 . 



Correiatlons of Stradaptlve 1 Scores with 
Conventional Test Scores' 
(N-201) 



- Score - . Correlation Eta * 

, ^ ' ' ' 

t Ability Scores Hk • 

1 . ^ / .782 .80g • 

2 ' . .768 , .799* 

3 .790 . .794 

4 ' ' .784 .'795 . . ^ 

5 > • .769 * * .773/ 

6 . • .791 . . .798 ( 

7 . .812 .820- 

8 .859 ' i- ■ ■ .880* 
• 9 .860 K .885* 

( 10 • * .800 - / .825 

CTondlstency Scores 

11 .058 - .303 

12 ' ' ' .170 . . .355 

13 .231 . .346 
14. . ' .205 , .531* 

• 15 ' >. . ■ .200 ^' . '.240 

Note: The N i^eported refers to th^/maxlmum nuijiber of testees with valid 
data. Correlations for som^ scoring methods are based on fewer testees. 
*Curvillnearlty significant at. p<. 05 

• Table 9 also shows correlations of die stradaptlve consistency scores 
With the conventional xtest. scores. These correlations were unlfoirraly low, 
ranging from .058 for score 11 to' .231 for score 13. One consistency 
acore showed a significant curvilinear relatloncihip with the conventional 
test score. These data suggest that consistency scores provide information 
^feout testeefi which is not contained in ability scores derived from the 

CC^nventlonal test. 

,f 

Further Analysiss of the Stradaptlve Tests 
tnfcWcorrelations among Scores 

1 Prpduct-moment intercorrelations among the fifteen stradaptlve scores 
ari^Vliresented in the lower triangles of Table 10. • In the' Stradaptiye 1 
dat^, it is apparent that all the ability scores correlated highly among 
theni^elves (r-^-.ieS to .532). Closer inspection of the ability score inter- 
coift«iationi3 revealed four relatively distinct clu^sters. The three item 
diffldttlty scores, Scores 1, 2, and 3 formed three two-variable clusters,, each 
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with their resjieg-tiv^ stratiim difficulty scores, 4, 5, and 6. In addition, 
scfores 3 and 6,* the* highest non-chance item and stratum scores, forme^d a 
tight cluBter with scoreis 7 and 10, the interpolated stratum difficulty and 
average highest non-chance. item scores. The common feature tying these latter 
four scores together seems to be their 'close reliance oti the ceiling stratum. 
Only these four ability scores make explicit^ use oi the ceiling stratum via 
the functionally related highest npn-chance stratum. Score 9 also, is dependent 
oil the ceiling stratum but its furtlier dependence on the basal stratum, 
and the^fact that it is an average difficulty score, apparently lower its 
relatimijghip with this cluster. i 

The final ability score cluster was composed of scores 8 and 9. The bond 
between these two scores appears to be that they are both average difficulty 
scores; score 8 is the overall average difficulty of all items correct and score 
9 is the average difficulty of all items correct between the ceiling , and 
basal strata- ' » 

Clustering among the consistency scores ws even more obvious. Scores 
11 and 12, the overall variability scored, forjtted one distinct cluster and' 
scores 13, 14 and 15, reflecting the testee's variability between ceiling and 
basal strata, formed another. 

Eta coefficients are shown in the upper triangles of Table 10. Althou^ * 
eta is an asymmetric statistic, to conserve computer time etas were calculated 
in one direction only, the rows being the Independent variables. Due to the 
large sample size, curvilinear ity was significant in most cases. But the actual 
differences between eta and r were small and the same clusters of scores emerged, 
thus yielding the same conclusions as the product-moment correlations regarding 
the similaritieg^of^ scoring methods. 

The intercorrel^tions and inter-etas for Stradaptive 2 were somewhat 
smaller than those of Stradaptive 1, bu£-the same patt6%n observed in the' 
Stradaptive '1 intercorrelations was apparent. Fewer significantly curvilinear 
relationships were observed for Stradaptive 2 ^but this is due to the smaller 
number of testees in the s£radaptive 2 analysis. 

UtiJ^ity of the Stradaptive Consistency Scores in Predicting Stability 

Table 11 shbws the test-retest correlations for scores on the stradaptive 
and conventional Tiests as a function of consistency score J.ntervals computed 
"from initial stradaptive test records. Retest correlations are shown separately 
, for: 1) consistency score 11, each testee's standard deviation of difficulties 
of items encountered; 2) score 12, the standard deviation of items answered 
cdrrectly; and 3) score 13, the standard deviation of difficulties for items 
toswered correctly between the ceiling and basal strata. ^ 

Table 11 shows a strong moderator effect on test-retest reliability for 
consistency score 11, and, to a lesser extent for' score 12, with iio general 
moderator effect for score 13. For consistency score 11, the strongest 
moderator effect was observed for ability score 1. On this score, the very . 
high consistency group (mean consistency score of .517) had a test-retest 
correlation of r«.940. As consistency decreased, test-retest reliability 
also decreased monotonically, with the very low consistency group (mean«1.038) 
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Stradaptive 1 sund Conventional Test Test-retest Correlations as a 
Function of Initial Test Consistency Scores 11, 12^nd 13 



Status on Consistency Score 11 







Very 
High 


High 


Average 


Low 


Very 
Low 


Mean Consistency Scor^ ^ 




.517 


.625 


.706 


.815 


1.038 


Number of Testees in Interval 




27 


30 


41 


43 


29 


Stradaptive Ability Seere: 


1 


.940 


.849 


.847 


.768 


■ .652 




2 


.875 


.721 < 


V -799 


.778 


TCI 

.751 




3 


.956 


.613 


>«W878 


.826 


.708 




4 


.934 


.840 


. 847 


.7^1 


.664 




5 


.896 


.722 


'. 793 , 


.756 


.741 




6 


.950 


.798 


.886 


.820 


.704 




■ 7 


.970 


.844 • 


.902 , 


.851 


.7.58 




8 


.981 


.927 


.915 ^ 


.853 


.869 


< 


9 


.983 


.939 


.907 


.899 


.889 




10 


.951 


.nt 


.882 


.822 


.718 


Conventional Test 




.979 




.918 


.826 


.878 








status on 


Consistency 


Score 


12 




Very 
High 


High 


Average 


Low 


Very 
Low 



Mean Consistency Score ^ 
Number o£ Testees in Interval 
* Stradaptive Ability Score: 1 

%\ 
■3 

• 4 
.5 
- 6 
7 

' 8 
9 
10 

Conventional Test 



,379 
30 
.892 . 
.764 
.913 
.895 
.783 
.908 
.943 
.959 
.968 
.906 
.962 



39 
.833 
.778 
.835 
.813 
.743 
.827 
.859 
.920 
.935 
.§23 
.852 



40 
.909 
.850 
.900 
.903 
.. 870 
.890* 
.921 
.946 
.926 
.894 
.952 



27 
.784 
.823 
.856 
.715 
.831 
.867 
.870 
.841 
-.876 
.858 
.620 



34 
.724 
.684 
.697 
.781 
.670 
.686 
.737 
.857 
.883 
.700 
.904 



Very 
High 



Status pn Consistency Score 13 



Very 



Mean Consistency Score ^ 
•Number of Testees "-In Interval 
Stradaptive Ability Score: 



1 
2 
3 
4 
5 

7 
8 
9 
10 



.119 
34 
.853 
.775 
.746 
.851 
.758 
.750 
.790 
.921 
.892 
.746 
.908 



High 




Conventional teat 

•total number of testees with valid data, 
scoring methods are based on fewer cases. 



.282 
17 
.741 
.755 
.861 
.800 
.747 
.873 
.890^ 
.930' 
.855 
.^73 
.765 

Retest reliabilities for some 
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' Average . 


Low 


Low 


.376 


.488' 


.670 


29 , 


30 


31 


.804 


.812 


.825 


.776 


.876 


.773 


.871 


.885, 


.810 


.767 


.801 


• .838 


.780 


.877 


.812 


.862 


.893. 


.783 


.906 


.906 


.822 


,930 


.924 


.915 


, .9^7 


.921 


.912 


.857 • 


.887 


.808 


. .914 


.926 


.856, 
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obtaining a test-retiecit reliability dVonly r^n€52. Similar results were 
obtained for the other ability ecories using conisiistency score 11 as a 
moderator variable* ThfiS potential utility of score H is shown by the . 
extremely high test-retest correlationa for the Very high consistency group 
fjox ability scores 8 ahd $ (r«.98l and •983, respectively) . This Indicates 
that the retest ability scores of ^testees wh^ae response records show 
little variability on initial testing are almost perfectly predictable from 
their initial ability test scores. Using these same ability scores and 
score 11 as the consistency score^ the. very low consistency groiip Is 
considerably less predictable on retest (r-*869 and > 889, respectively)* 



When testees were subgrouped on Initial consistency frpm stradaptlVe score 
11 and retest reiiabllitles were computed using conventionai tesjt scores^, the 
resul|:s were not of the same pattern. Although the rettpst reliabilities 
were highest for the very hd^gh consistency subgrpup (r*«979)i the very low 
conaistency group did not have the lowest reliability (r«*878). However, the - 
fact that the very high consistency group had a. very high/ teat-pretest * 
reliability suggests some generality to, the moderator ej^ect for consistency 
scores as measured by scare 11 • * 

• Consistency score 12 showed a meaningful moderator effect^ for a number 
of the ability scores. With the escception of ability spores 2, 4, S^and 9 
the /very higH consistency group had* the highest test-retest reliability, and 
the very low consistency group had the lowest correlations. In no case, . 
however, was there the raonotonically decreasing reriabllity coefficients ^ 
obtained for several scoring methods using consistency scote 11 as the 
moderator variable; for score 12, the average consistency group tfended to 
obtain a highfer reliability than t'he high consistency group. For ability 
score 9* the only deviation from the monotonlc trend was the low consis- . 
tency group (r«.883). No general trend in stability correlations was 
observed for the scores on the conventional test when moderated by 
consistency Score 12, Conventional score stabilities were, however, quite* 
high (r«1962) for the very high consistency subgroup, as was found for score 11. 
■ . • ■ ^ • " 

Score 13 functioned very poorly as a moderator of^ test-retest stability. 
For only two of the ability, sqjores (1 and 4) WaS the t^lst-retest correlation 
highest in the very *'igh coiis latency group. For t%e majority of the other 
ability scores, and for the conventional test, stability correlations were 
highest for the low consistency subgroup. 

Appendix Table -A-4 shows test-retest reliability correlations, as a 
function of consistency score Intervals for Stradaptive 2. For this vari- 
ation of- the stradaptive test, the predicted pattern of stabilities did not 
occur for any of the consistency scores using any of the ability scores. 
These results coul^ be due to sampling fluctuations, or they could be due to 
the differences between branching strategies used in Stradaptive 1 and 2. 

Stability of the Stradaptive Test Respon se Records . 

Stability o£ within strata data . Table 12 .presents the redundancy 
analyses for total number of Items answered within 8trarf:a and proportion 
correct within strata.: The latter data were .referred to as iUbject 
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characteristic curves", possibly reflecting th€» Scalability of the In'dlvldual 
with respect to a fet of Items. 

• . Table 12 

» 

* • . * ■ 

Redundancy Analysis far Number of Items Administered arid 
Proportions Correct Within Strata 



Stradaptive 1 ' . . * , 

Numb.er of Items Administered within Strata 

Redundancy of retest given initial test , .414 

Redundancy of initial test given retest .439 

Proportions Correct within Strata 

Redundancy of retest given initial test ^ .670 
Redundancy of initial test given retest .668 



Stradaptive 2 

Number of Items Administered within Strata 

Redundancy of retest given initial test .319 , 

Redundancy of initial test given retest .351 

Proportions Correct within Strata 

Redundancy of reftest given initial test ' •528 

• Redundancy of initial test given retest * -471 . 



For Stradaptive 1, 41-4% of the variance in number of items administered 
within strata was predictable on retest from initial test data. The proportion 
correct within strata was more predictable on retest /however. For these 
data, 67% of .the retest variance was predictable from initial test scores. 
This result is equivalent to an average multiple correlation of about .82 ^n 
predicting an individual's proportion correct within a stratum at retest 
from his Initial test data. , 

For Stradaptive 2, redundancy for number of items administered within 
strata wao .319. while that for proportion correct was .528, These results 
support earlier flftdlngs that scoreo on Stradaptive 2 are less reliable than 
those on Stradaptive 1. However, It supports the finding with Stradaptive 1 
that the proportions correct data are likely to be more uaefhl than th^ - 
number of items administered within strata. / 

Stability of total number admlnlatered . /Test-reteal correlations' of 
total number of items administered In the two 8tradaptlv| teats "^re r-.335 
and r-.055 for Stradaptive 1 and 2, respectively. This finding partially 
accounts for^he fact that proportions correct were more stable than number of 
Items Udmlnlstered within strata. . i > 
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Relatlve Difficulties of Itemg Producing Different Kinds of Responses 

Table 13 gives means and standard deviations of scores In both nb^rmal 
ogive difficulty units and stratum units for deviations of Item difficulties 
from ability as determined from scorjs 8. These distributions are presented 
as conditional on the type of response given (i.e* , correct , incorrect or> 
question mark) • . ' ' 

The average deviation of items .^^^wered correctly from score 8 was » 
0.0, but this is artlfactual since^ ^bre 8 was defined as th6 mean difficulty 
of all items answered correctly. .Ftom. the standard deviations in. stratum 
units, it can be seen that the ^if^^Ities of 95% of all items answered 
correctly fell within 2.25 strata "^l^^e or /below the final ability estlmat07:^^ 
It is also apparent that the difficulties of incorrect items were slightly 
greater than one strattun more difficult than cbrxeGtrATesponses. ^ 

table 13 . \ ' 7 



Deviations of Item Difficulties ffom Score 8 
for Three Types of Response , 



Stradaptive T ' 



Stf adaptive > 



Normal Ogive 
Difficulty Units 



Stratim 
Units 
Mean S.D% 



Normal Ogive 
Difficulty Units 
Mean S.D. 



Stratum 
Units 
Mean S'lD*. 



Correct 


.000 


.745 


.000 


1.146 


V. 000 


.733 


.000 


1.128 


Incorrect 


.724 


.751 


1.114 


1.155 


'.689 


.730 


1.060 


1.123 


Question 


















Mark 


.792 


.775 


1.218 


1.192 


.814 


•791 - 


1.252 


1.217 



Mean difficulties of items responde<i to with^ a question mark were 
greater than mean difficulties of* items answ^^ed incorrectly — .068 and 
.125 units (or .11 and .19 strata) more difficult for the twp stradaptive 
forms, respectively. Since the strad^P^^ve cisting strategy attempts to adapt 
the item difficulty to the ability^p^^ the testee, and since the question 'mark 
responses appear td indicate that items responded to in that way are even»more 
difficult than those items answered Incorrectly, It is obvious that the testee 
should be branched to a less ^difficult item following a question mark response. 
Discarding these item responses Is an inefficient strategy for dealing with 
question mark responses. ^ 

Test Length vs. Ability 

Table lA shows correlations, of all scores with test length (number of 
items administered to each testee at initial test) on both stradaptive forms. 
Also shown is the correlation of conventional test score with Stradaptive' 1 
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test length. All sftradaptive' ability scores except scorer 8 and 9, the 
average difficulty scores, correlated slightly' with test length, on both 
forms of the stradaptive test^ Consistency scpres showed moderate to high 
'correlations \0^ith test length pp;ip th forms* Stradaptive scotes 8 and 9, 
and conventional test scoreB;^^'^^^ essentially zero cprrelaft ions with test 
length in all casea* . ' 



Table 14 



Correlation of^ Stradaptive and Conventional Test Scores with 
Number of 'ite^ Administered on th6- Stradaptive Te$t 







Stradaptive 1 


Stradaptive 2 


,Score 




.(N=A76) 


(N=»113) 



Ability Scores 
1 



.263 • ' ^79 

2 > .254 ' -302 

3 ' .292: ' .399 

4 " , .258 -•. .381 

5 .253 '298 

6 .301 -^05 

7 .241 .332 

8 -.020 .094 

-.046 .046 

.293 .398 



9 
10 

Consistency Scores 
11 . 

13 .727 .693 

.787 ' .781 

.788 -.782 



.449 .375 
.455 „ .A15 



14 
15 



Conventional Test .037 



WE^r. N*s shown are number of testees with valid data. Certain scoring 
methods have fewer valid cases. - 

V 

The most parsimonious explanation for the slight correlations of 
ability scores witH^tesf length is that the less discriminating items at 
the upper strata cause greater variation in the test record, which in turn 
causes the test to take longer to satisfy the- termination criterion. This ^ 
explanation is supported by the correlations of consistency scores with test 
leiigth. .This explanation- fails, however, to-^plain the zero correlations of 
scores 8 and 9 with, test length. , ^ - 

An alternative explanation is that test length increased by incon- 
sistent response records (due to test-testee interaction) which have a ceiling 
stratum more distant from actual ability level only because the range of 
item difficulties encountered was greater. Average difficulty scores would 



not be affected by this phenomenon^ but moxitouia pe^rfoirmance i^cpig^ (e.g. , 
scoired 1 and 4) and scores dependient upon the* ceiling, stratum (e^g*, scores 
3, 6, 7 and 10) would be. This explanation is supported by 1) the data 
showing positive correlations between test length and consistency sc^pres 
(especially the indices of distance between the ceiling and basal g^tr 
scores M and 15) , and 2) the score intercorrelat ion- data ^ ^ich showed 
moderate positive correlations between consistency scores and* all ability 
scores, except scores 8 and 9 which correlated only slightly with consistency 
scpres^ This explanation also suggests that there is an undeslra^bie % 
interaction between scoring method ..and the termination ;cr iter ioh^ rather 
than any deficiences in the item pool 

- . — • > /^UMMARX AND CONCLUSIONS ^ ■ 

■ * ' ■ , -■ ' . 

The most interesting distributional difference found in this study was 
that Stradaptive 2 scores h^d lower means than comparable Str^d^ptlve 1 
scores^ This was surprising because Stradaptive 2 allowed teatees to skip 
the more difficult items. Other distributional chairacterl6ticS liypirthy of 
note were the close distributional similarities between scores on the 
40-item conventional test and scores- on the five 29— item analogous jsTairs 
of conventional tests, and the positive skew present in the stradaptive 
tests but not in the- conventional test. ' 
* " , . * * 

Although the distribution of underlying ability is not known for this 
college population, it is not unreasonable to assume that it is not normally 
.distributed. Rather, a college population is likely to be positively skewed 
in verbal ability, since low ability testees* would not qualify on entrance 
examinations which are highly verbal. This suggests that the stradaptive 
test better reflects the distribution of verbal ability in these testees, 
than does the conventional test. Furthermore, regardless of the distribution 
of ability in the population, the positive ske^ in the stradaptive scores 
suggests the capability of making finer discriminations among high ability 
tes^iees than does the conventional test. And, it would be expected that a 
college population would include a number of very high ability testees 
whose scores would skew the distribution in a posi4:ive direction. 

For Inter-strategy comparisons it is important that internal consistencies 
of all tests be equal. In this study, the stradaptive tests had more 
discriminating items and thus higher internal consistencies. 

' Additional differences between the stradaptive and conventional tests 
confounded the comparison of test-retest stabljlities between tests. Differences 
in lengjihs and memory effects were corrected for by creation of analogous 
conventional test pairs matching the stradaptive tests on psychometric 
characteristics. But the many corrections required limit the conclusions 
that can be drawn regarding stability comparisons. ^ The results were rather 
inconclusive with the corrected conventional test stability being slightly 
higher than the best, scores of Stradaptive 1. 

. Stradaptive 1 ab'ility scores showed a decreasing trend In stability with v 
time while stabilities for the conventional test showed no trend with time. 
However, the differences between time intervals were short and a longer 



Interval would be expected to show trends for both tests. The JE act that the^ 
stabilities of the Stradaptive 1 scores did change with time While those of 
the conyetitional test ^id not may have been a ftmCftion of the greater potential 
for: m^ory effects to Inflate te^t-retest stabilities in the coiiventiorial 
test, In which all 40 items were repeated oti retest^ The systematically 
decreas^lQg trend of test-pretest stabilities hds not been observed in other 
empiricai studies of adaptive tests (e.gj., Betz and WeissV 1974; Larkin and* 
Weiss, 1974). 

In contrast to the few meaningful inter-strategy comparisons provided , * 
by this study, much was learned about how to build a stradaptive test. 
Intercorrelations between the stradaptive scores showed four relatively^ 
distinct ability score clusters and two consistency score clusters. The 
ability score clusters included; 1) maximum performance scores; 2) (N4-3.)t?i 
item and stratum scores, 3). highest non-'chance scores; and 4) average 
difficulty scores. . Average difficulty scores showed the highest stabilities 
in this study* (N+l)tfe item and stratum scores showed the lowest stabilities, 
a finding probably due t^.the^^ number of strata available wlilch allowed 

scares to change by> ^6%..^ ' the total score range on the basis of the 
re&ponse to the laBtJ^^^^^k&i^l&teTed. The moderate stabilities of such 
chance-influenced scores as the highest difficulty scores is favorable 
In that it shpws that the stradaptive test can contain an individual testee 
within items in the range of his ability.. That the scores dependent oii the 
celling stratum were only moderately stable may have been due to their 
joint dependence on central tendency and variability. 

The consistency scores clustered into overall variability and distance 
between celling and basal strata scores. Score 11, an overall variability 
score, functibned as a meaningful moderator of ability score stabilities for'' 
Stradaptive 1. However, the stabilities of^ :he consistency scores were 
only moderately high, although there was a tendency for the stabilities to 
increase with longer time Intervals. Stradaptive 2 consistency scores were 
not predictive of test-retest stability of ability scores, but these^ 
results were given little weight because of the other erratic results 
obtained from Stradaptive 2. % 

Stabilities Of the total stradaptive response records, as assessed by/ 
the redundancy analyses,' were somewhat lower than the stabilities of the 
best ability scbres but higher than those^ of the consistency scores. 
The proportions correct within strata were more stable than the numbers 
administered within strata. Total number of items administered in the 
stradaptive test was relatively unstable on retest. 

The relative difficulties of correct, incorrect, and question mark • 
responses suggefet that the question mark response is used by the testee noi 
when the testee is unsure of the correct response but rather when he has / 
no idea what the correct response is and prefers not to guess randomly. 
Items which were responded to with a question mark were even more difficult,, 
on the average, than those which the testee answered incorrectly. This 
result, when combined with the stability data for the two forms of stradaptive 
tests, suggests Stradaptive 1, in which question mark responses were counted 
as ittcorrect responses, is a better testing strategy than Stradaptive 2, 
in which question mark responses were ignored* 
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A slight correlation was obtained between test length and all ability 
scores except the average difficulty scores. This was explained as resulting 
from scores being defined as a joint function of central tendency and 
variability, the latter of which causes^ test length to increase. To confirm 
this hypothesis, a computer simulation of this phenomenon is required. If 
test scores can be changed by increasing the variability of the response 
record, which will probably be manipulated by changing .the item discrimination, 
all but average difficulty scoring methods (i.e., score 8 and 9) will have to 
be discarded. 

This study did not provide ^ clear answer to the question of interest: 
whether a conventional or a stradaptive testing strategy provides better 
measuirement . A future study aimed at answering this question must control 
extraneous variables such as item discrimination, memory effects, effects of 
initial ability Estimates and test length. 

Two interesting aspects of the stradaptive testing strategy were not 
investigated in this study. First, the usefulness of an initial ability esti- 
mate was not determined. Monte carlo simulation methods would be appropriate 
to determine how much information is added to a stradaptive test by initial ' 
ability estimates when the ability estimates have differing degrees of cor- 
relation with underlying ability. The other important aspect of the stradaptive 
test that was. not investigated in this study was the utility of flexible 
termination. What must be determined in future study is what variable. If any, 
is sufficiently related to error of measurement to provide a basis for deciding 
when to terminate the ^tradaptive test. . 

■ - * 

^^^'^i A final refinement of the stradaptive test tl\at needs investigation 
is^the development of an optimal scoring strategy for the stradaptive test. 
The average difficulty scores were the most stable in this study but, a 
scoring technique based on a more adequate theoretical rationale might have 
superior psychometric characteristics, or more practical utility. 
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Table A-2 
Conventional Test Item Parameters 







4 




1 ■ 




Traditional 


Normal Ogive 


* 


P 


Sis 


D 


a 


Mean 


.537 


.472 


-.188 


.543 




.101 


.078 


.592 


.112 


Maximuun 


.661 


.612 


1.155 


.774 


Minimum 


.267 


.296 


-.956 


.310 




.661 


.434 


■ -.956 1 






.656 


.543 


-.739 


^^647 




.659 


.490 


-.835 


.562 




.469 


.572 


.136 


.697 




.646- 


.520 


-.719 


.609 




.646 


.477 


-.784 


.543 




.651 


.531 


-.730 


.627 




.640 


.494 


-.725 


.568 




.634 


.543 


-.630 


.647 




.634 


.503 


-.680 


.582 




.623 


.456 


-.686 


.512 




.558 


■ .518 


-.281 


.606 




.608 


.371 


-.738 


djpo 




.613 


.320 




■ .338 




.607 


.516 


-.525 


.602 




.615 


.315 


-.927 


.332 




.604 


. .427 


-.617 


.472 




.602 


.538 


-.480 


.638 




.458 


.612 


.172 


.774 




.458 


.611 


-.172 


.772 




.557 


.448 


-.319 


.501 




.559 


-.501 


-.296 


.579 




.559 


;.527 


-.281 


.620 




.549 


.496 


-.248 


.571 




.542 


-~ .451 


-.233 


.505 




.539 


.531 


-.184 


.627 




.542 


.490 


-.215 


.5^2 




.529 


.424 


.171 


.468 




»471 


^ .385 


-.189 


.417 




.514 


.448 


. .078 






.500 


.519 


-.001 


.1507 




.506 


.428 


.035 


.4M 




.449 


.520 


.246 


.609 




.470 


.400 


.188 


.436 




.463 


.537 


.173 


.637 




.340 


.359 


1.151 


.383 




.267 


.538 


1.155 


'.638 




.386 


.296 


.977 


.310 




.335 


.440 


.968 


.490 




.365 


.353 


.976 


.377 



Table A-3 

Characteristics of Scojre Distributions for Stradaptlve 1, 
Stradapt;lve 2» and the Conventional Tei^t on Retest 



Stradaptive 1 



Score 


a 


nean 


Del}* 


Skev 


Kurtosls 


Aoixxuy ocore* 












1 


180 


1.250 


1.172 


-.334 


/ -.635 


2 


180 


.854 


1.410 


.107 


-.760* 




1 on 


• 793 


1.286 


-.066 


-.822* 


4 


180 


le200 


1.141 


-.578* 


-.539 


5 


180 


.870 


1.396 


.010 


-.671 


6 


180 


.629 


1.246 


-.176 


-.817* 


7 


180 


.697 


1.206 


*-.134 


-.871* 


8 


180 


.129 


1.070 


.017 


^ -.785* 


9 


160 


.276 


• 1U24 


.068 


-.833*- 


XO 


180 


.607 


1.241 


-.154 


-.846* 


Consistency Scores 












11 


180 


.746 


.195 


.630* 


.232 


12 


180 


.668 


.228 


.550* 


.476 


^ 13 


160 


•398 


.232 


.421* 


-.431 


lA 


180 


z.jni 


.932. 


,675* 


.270 


15 


180 




^921* 


.662* 


.233 


Stradaptlve 2 


Score / 


H 


Mean 


S.D. 


Skev 


Kurtosls 


Ability Scores 












. 1- 


98 


.809 


^1.126* 


.350 


- .789 


2 


98 


.324 


1.283 


j.418 


- .162 


3 


98 


.3a2 


1.215 


.436 


- .504 


4 




.J67 


1.094 


.274 


-1.015* 


5 


98 . 


.315 


1.261 


.441 


- .ui^ 


6 


'98 


.171 


1.205 


.383 


- .632 


7 


98 . 


.256 


1.153 


.479 


- .539 


8 


98 


-.289 


.956 


.604* 


- .282 


9^ 


83 


-.137 


.996 


.667* 


.235 


10 


97 


.114 


1.187 


.359 


- .586 


Consistency Scores 












11 


98 


.761 


*179 


.407 


.339 


12 


98 


.677 ^ 


.202 > 


.291 


- .269 


13 


83 


.383 


.212 


-.031 


-1.308* 


14 


98 


1.898 


.860. 


.217 


- .933 


15 


9r 


1.918 


.849* 


.207 


- .914 


Conventional Test 


Score 


N 


Mean 


S.D« 


Skew 


Kurtosls 


AO-ltens 


194 


.619 


• 211 


-.195 


- .796* 


29-ltea ^J 












analogous form 


194 


.620 ^ 


.213 


-.166 


- .870* 



*S.D. Is multiplied by .65. 

^All statistics for the 29-lteffl analogous form are means of statistics calculated 
on five coiablnatlons of Items. ^ 
*Slgnlflcantly different from zero at p<.05. 
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Table A-4 



Stradaptive 2 Ttist-rete^t Correlatlotid as a Function of 
Initial Test Consistency Scores 11, 12, and 13 









Status on 


Consistency Score 


11 






Very . 








. Very 






HlRh 


HlRh 


~ Average 


Low 


LOW 


Mean Consistency ^Score ^ 




.524 


.631 


.704 


.820 


1.033 


Number of Testces In Internal 




18 


Itf 


15 


15 


17 


Stradaptlve Ability Score; 


JL 


• J/D 




.747 


.887 


.598 




Z 


.UU/ 


• Oil 


.535 


.732 


.588 




Q 






.680, 


.847 


.603 




/. 


• 219 


Q/. Q 


.733 


.842 


' .592 




c 
J 


1 m 


• ovD 


.558 


.769 


.586 




6, 


• 550 


• 823 


.669 


.8'58 


.534 




7 


.592 


• 858 


.698 


.881 


.563 




8 


.589 


• 876' 


.873 


.912 


.764 




9 


• 846 


.860 


.786 


.918 


.662 




10 


.569 


• 829 


.689 


.856 


.621 
1 






) 


Status on 


t 

Consistency Score 12 






Very 








Very 






HlRh 


High 


, Average • 


Low 


Low 


Mean Consistency Score ^ 




.346 


• 552 


.661 


.765 


.973 


Number of Testees In Interval 




n 


17 


' 15 


15 


15 


Stradaptlve Ability Score: 


1 


• 330 


.769 


. .905 


.867 


. .750 


2 


• 581 


• 590 


.827 


.440 


.665 




3 


• 469 


. o\jo 


.910 


.766 


.665 




4 


.306 


too 

.723 


.876 


.844 


.769 




5 


.575 


. bic 


.874 


.647 


.656 




6 


.449 


.798 


• .915 


.688 


.666 




7 


.509 


.819 


.937 


.700 


' .669 




8 


.500 


.874 


.944 . 


.846 


.805 




9 


.718 


.809' 


.955 


. .785 


.766 




10 


.471 


.808 


.916 


.783 


.679 


k 






Status on 


Consistency 


Score 


13 






Very 








Very 






' High 


HlRh 


Average 


Low 


Low 


Mean Consistency Score ^ 




.117 


.255 


.379 


.472 


.665 


Number of Testees In Interval 


I 


13 


14 


■ 13 


13 


14 


Stradaptlve Ability Score: 


1 


.590 


.758 


All 


.850 


.848 


2 


.652 


.628 




.731 


.717 




3 


.773 


.808 


;a35 


.817 


.795 




4 


.555 


.748 


.783 


.823 


.876 




5 


.562 


.669 


.642 


.802 


.714 




6 


.774 


.736 


.787 


.824 


.801 




7 


.685 


.747 


.792 


.870 


.797 




8 


.732 


.859 


.807 


.900 


.903 




9 


.715 


.868 


.805 


.958 


.822 - 




10 


.771 


.816 


.827 


.838 


.799 



*Total number. of testees with valid data, 
•coring methods' are based on £ewer cases. 
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